home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Floppyshop 2
/
Floppyshop - 2.zip
/
Floppyshop - 2.iso
/
diskmags
/
5791-.end
/
dmg-6143
/
lza_rept
/
compres1.txt
next >
Wrap
Text File
|
1995-05-01
|
17KB
|
324 lines
Experiments in Data Compression
by John Kormylo
E.L.F. Software
Introduction
When first developed, LZA compression was superior to that
obtained using ZIP or LZH. Since then, both ZIP and LZH have
improved their compression efficiency dramatically. This report
documents attempts to improve LZA to regain its advantage. While
it is possible to out-perform ZIP consistently, the resulting
algorithm is glacially slow.
LZA Description
The LZA algorithm is a variant of LZSS and adaptive
arithmetic (see "The Data Compression Book" by Mark Nelson). It
differs from standard adaptive arithmetic by using a "new code"
character. It differs from LZSS by using additional codes to
transmit match tokens.
The "new code" character is followed by a (relatively)
uncompressed character which had previously been assigned a count
of zero. Adaptive arithmetic normally assigns a minimum count of
one to all 256 characters, which makes the coding inefficient
when only a few characters are used (e.g., text files). The
overhead for entering the character set using the "new code"
character is comparable to the overhead for transmitting count
information for a non-adaptive method.
The "new code" character always has a count of one (until
no new codes are left.) The new codes are entered by dividing
the current range (high - low) into 'n' segments of equal size,
where 'n' is the number of remaining characters with a count of
zero.
LZSS adds an extra bit before every character or match token
to indicate which is being used. LZA introduces an additional
character which is followed by the match token. If there are
exactly the same number of tokens as characters transmitted, then
this additional character will require only one bit to be coded
and every other character will require one additional bit to be
coded. For any ratio other than one to one, using this special
character is more efficient than using a single bit.
Actually, LZA adds 64 additional characters corresponding to
64 possible match lengths. This takes advantage of the fact that
short matches are far more common than long matches by assigning
a count to each match length. Also, since matches longer than 32
bytes are rare, assigning an initial count of zero and using the
"new code" character turns out to be very effective.
Past locations in the message are organized into a self-
balancing binary search tree for faster matching.
LZA Improvements
Not all past locations should be included in the binary
search tree. Nelson mentioned that it is common to not include
those locations skipped over when a match is found for reasons of
speed. As it turns out, the probability of matching one of these
locations is so low that not including them in the search tree
improves the compression as well.
This is not the case, however, with the location matched.
While there is already another location in the search tree which
matches the first few characters, the probability of finding a
longer match using the new location is high enough to justify
its inclusion.
Another improvement can be made by keeping a count for how
often each location is used. Rather than dividing the current
range into segments of equal size for each location in the search
tree, the size of the segments can be made proportional to this
count (set to one initially). Also, when the tree is full, the
new LZA does not throw out old entries with counts greater than
one. Instead it decrements the count and allows it to cycle
through again (until the count goes to zero).
For speed, LZA keeps a table of search tree indexes sorted
by count (in decreasing order), plus an array containing the
number of entries with counts 1 through 16. The sum of all
counts up to an entry with a count of one is given by the sum of
all counts (precomputed) minus the number of entries after the
index (times one). This is much faster than summing up to 16K
numbers.
Also, if a location is used more than once, odds are that it
will use the same length as before. Therefore, one of the length
codes means "use the same length as last time." (It replaced the
length = 2 code, which was never used in practice.) Since a
location must have been used before for this length to be
defined, only those search tree entries with counts greater than
one are used when computing the index code.
While searching through the binary tree will easily locate
the longest match, it will not necessarily locate the best match
of that length (the one with the highest count). Any match with
the same length as the one found will be in the sub-tree starting
with the entry found, but one must search both branches to find
all of them. This can be done using a recursive function to
search both branches whenever a duplicate match is found.
Tree Size Experiments
Text files generally do best with a very large tree (16K
entries or more). Programs do better with trees with fewer
entries. A sorted text file (actually, a dictionary) worked best
with a 16 entry search tree.
The following table shows the compressed file size for three
test datasets as a function of tree size. For comparison the
corresponding ZIP compressed file sizes are included. (Two of
these files can be downloaded from Genie, and one is from the
Atari developer's package.)
One way to improve performance over a wide range of tree
sizes is to use more than one tree at a time and include a code
to select which tree to use. Best results were obtained using 3
trees: sizes 16, 512 & 16K. These results are included in the
following table (as "3 trees").
Actually, there are many ways to do this. One way would be
to use a different group of length codes for each tree. Another
would be to use regular codes to indicate the tree selected and a
separate code set (like the index code) for the length indicator.
However best results were obtained using the same length codes as
before and a separate set of codes for the tree selection
indicator.
tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
16 | 45055 | 46100 | > 476285 |
32 | 44186 | 43144 | 488860 |
64 | 40211 | 41018 | 506394 |
128 | 37345 | 38961 | 521958 |
256 | 34163 | 36954 | 539121 |
512 | 31198 | 35443 | 559172 |
1024 | 28756 | 34483 | 582964 |
2048 | 27039 | 33763 | 612184 |
4096 | 25765 | > 33673 | 639819 |
8192 | 25114 | 33822 | 658753 |
16384 | > 24942 | 34003 | 670774 |
-------+--------------+-------------+--------------+
ZIP | 25207 | 32444 | 496880 |
-------+--------------+-------------+--------------+
3 trees| 24928 | 33073 | 559810 |
avg | 24638 | 33139 | 556216 |
-------+--------------+-------------+--------------+
gaps | 28202 | 33867 | 471766 |
trim | 24910 | 33163 | 531371 |
best | 22828 | 32191 | 457176 |
It is interesting to note that this method out-performed the
best single tree size for two of the three files. Also, the
resulting counts showed that all three trees were used
frequently, and that the smaller trees were used far more often
than would be indicated by their relative sizes.
Up to this point, comparisons between matches of different
lengths were made by adding the cost of the missing characters to
the shorter of the two matches. However, it is possible that
more than one short match could replace a long match. To
overcome this problem, one can compare the average cost per
character for the two matches. Specifically, this is computed
using
avg = log(4294967296.0 / size) / len
where 'len' is the number of characters and 'size' is the segment
size (starting from 4294967296) for resulting arithmetic
compression for the length, tree, and index codes.
This not only works well for selecting which tree to use,
but also whether to use a match at all (by comparing the average
cost per character to the cost of the first character). The
results are shown in the first table (as 'avg').
The most surprizing thing is that this approach performed so
poorly on WORDLIST.TXT relative to how it performed using only
the 16 entry tree. A test was run which used the 16 entry tree
wherever possible, and the other trees only when no match was
found. The resulting compressed dataset was 556869 bytes long.
It was also discovered that the 'avg' method used the 16
entry tree 93278 times, the "use 16 if possible" test used the 16
entry tree 99475 times, but using the 16 tree only found 164708
matches. These matches must have come from INSIDE the matches
found in the 512 and 16K entry trees. Obviously, no "greedy"
algorithm was going to find those matches.
Another test was run which only looked for matches from the
large search trees to fill in the gaps between matches with the
small trees. The results are shown in the first table (as
'gaps'). While this worked well for WORDLIST.TXT, it performed
very poorly on DEMONSTR.DOC.
A similar test was run which used shorter matches when the
longest match found extended past the start of a match from the
16 entry tree. The results are shown in the first table (as
'trim'). This performed well on all test files.
While these tests were not good enough, they did show that
using three trees could give acceptable results if the proper
matches were used. However, to find these matches it would be
necessary to test all possible ways to compress the data. The
algorithm tried would check all possible combinations of matches
and characters until a unique first step was found, 32 bits of
compressed output was produced, or 256 bytes of input data were
compressed (whichever came first). After the first step was
implemented, the process was started all over again since the
counts and search trees had been modified and the old analysis
was no longer valid. The results are shown in the first table
(as 'best'). The only problem with this method is that it is
VERY slow.
Finding Contiguous Matches
In this section we will attempt to equal the 'best'
algorithm using another approach.
If a match can be found which extends past the end of a
larger match, but no match can be found at the end of the larger
match, then the location inside the smaller match corresponding
to the end of the larger match must have been skipped. Therefore
by including the skipped locations in (some) of the search trees,
one should be able to locate more contiguous matches.
The following table shows the compressed file sizes using a
single search tree which includes all locations (no skipping).
These do not perform nearly as well as the corresponding trees in
the first table.
If we combine the 3 trees as before, the results are
surprisingly respectable. The '0,3' values use no skipping in
all three trees (16, 512, & 16K). The '1,2' values use skipping
in the 16K entry tree but not in the other two. The '2,1' values
use skipping in all but the 16 entry tree.
tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
16 | 45334 | 47560 | 623420 |
512 | 39661 | 40590 | 620509 |
16384 | 34762 | 37505 | 696476 |
-------+--------------+-------------+--------------+
0,3 | 25401 | 32829 | 516093 |
1,2 | 24570 | 33000 | 514302 |
2,1 | 24641 | 33163 | 511917 |
-------+--------------+-------------+--------------+
3,1 | 24612 | 33094 | 507442 |
3,2 | 24457 | 32843 | 504815 |
-------+--------------+-------------+--------------+
mixed | 24261 | 32854 | 503976 |
Even better results can be obtained using additional trees
which contain only the (previously) skipped locations. The '3,1'
values use the previous 3 trees (16, 512 & 16K) plus an
additional 16 entry tree which contains ONLY skipped locations.
The '3,2' values use the same plus an additional 512 entry tree
which also contains ONLY skipped locations.
The best results for this approach were obtained using a
'3,2' tree combination as before, but now the first 16 entry tree
contains ONLY the matched locations while the second 16 entry
tree contains ALL locations. These results are shown in the
above table (as 'mixed').
Biting the Bullet
While the results in Table 2 are interesting, the only way
to consistently out-perform ZIP is to use the full search
('best') algorithm. In this section we will be concerned with
improving this approach.
There are several reasons why this approach is more time
consuming. First, it must search for a match at every location.
Second, it must search for the best match for each possible
length. The time required to perform this search is roughly
proportional to how many matches are found. Finally, it must
repeat some of these searches, since the results are thrown away
once the next code is determined.
In the original algorithm, the limiting factor was almost
always the 32 bits of output. In fact, one can reduce the
maximum number of input bytes to 65 with no penalty at all.
Further reduction requires also reducing the maximum match
length. The results for various length parameters in shown in
the next table. However, the resulting increase in speed was not
substantial.
length | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
65 | 22828 | 32191 | 457176 |
33 | 22928 | 32261 | 457174 |
17 | 23794 | 32726 | 455700 |
9 | 26552 | 34073 | 484471 |
On the other hand, by increasing the number of output bits
(actually, by computing segments sizes with floating point
numbers instead of unsigned longs) one can improve the
compression, as shown in the next table. At least using floating
point operations does not significanly slow it down any further.
The limiting factor now becomes the number of tests needed to
determine a unique first step. This is usually only slightly
larger than the initial match length.
length | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
65 | 22196 | 32035 | 455017 |
33 | 22329 | 32137 | 455261 |
-------+--------------+-------------+--------------+
len | 22522 | 32054 | 461290 |
first | 22595 | 32067 | 479614 |
save | 22170 | 32037 | 456498 |
One way to reduce the number of match searches is to assume
that a match taken from a smaller tree is always preferable to
the same length match taken from a larger tree. This turns out
not to be the case, however, as shown in the above table (as
'len'). Nor does it run much faster.
Alternatively, one could eschew the search for the best
match, taking the first match found. Again, there is little time
saved at the cost of compression efficiency. The results are
shown in the above table (as 'first').
Another possible way to improve speed would be to save the
results after each step (as opposed to starting all over again).
This would mean far fewer searches. However, it would require
performing all of the search tree updates and sorts and saving
enough information to undo them. Because of all the additional
tree and sort operations, this approach is only slightly faster
than the 'best' algorithm. The good news is that the performance
is about the same, as shown in the above table (as 'save').
Since LZW is inherently faster than LZSS, attempts were made
to use LZW variants together with the full search. However, not
only were the results not even in the same ballpark, the
algorithm speed still left a lot to be desired.